Preprocessing¶

In [79]:
using CSV, DataFrames, Statistics, Plots, CategoricalArrays, Dates, Random, Gurobi, JuMP
In [80]:
data = CSV.read("MLOpt_data.csv", DataFrame)

y = data."No-show"

X = select(data, [:Gender_binary, :Age_std, :Scholarship, :Hipertension, :Diabetes, :Alcoholism, :Handcap, :SMS_received, :lead_time_std, :day_of_week, :appointment_month])
X_features = select(data, [:Gender_binary, :Age_std, :Scholarship, :Hipertension, :Diabetes, :Alcoholism, :Handcap, :SMS_received, :lead_time_std, :day_of_week, :appointment_month, :Neighbourhood])
110458×12 DataFrame
110433 rows omitted
RowGender_binaryAge_stdScholarshipHipertensionDiabetesAlcoholismHandcapSMS_receivedlead_time_stdday_of_weekappointment_monthNeighbourhood
Int64Float64Int64Int64Int64Int64Int64Int64Float64String15String7String31
111.07858010000-0.667653FridayAprilJARDIM DA PENHA
200.818894000000-0.667653FridayAprilJARDIM DA PENHA
311.07858000000-0.667653FridayAprilMATA DA PRAIA
41-1.25863000000-0.667653FridayAprilPONTAL DE CAMBURI
510.818894011000-0.667653FridayAprilJARDIM DA PENHA
611.68453010000-0.536562FridayAprilREPÚBLICA
71-0.6094000000-0.536562FridayAprilGOIABEIRAS
810.0831057000000-0.536562FridayAprilGOIABEIRAS
91-0.695964000000-0.667653FridayAprilANDORINHAS
101-0.782527000000-0.536562FridayAprilCONQUISTA
111-0.306429000000-0.536562FridayAprilNOVA PALESTINA
120-0.349711000001-0.471016FridayAprilNOVA PALESTINA
131-0.652682100000-0.602107FridayAprilNOVA PALESTINA
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
1104470-0.176584010000-0.536562WednesdayJuneMARIA ORTIZ
1104481-0.00345758000000-0.602107WednesdayJuneMARIA ORTIZ
1104491-0.782527000000-0.667653TuesdayJuneMARIA ORTIZ
11045010.5592040000012.01971TuesdayJuneMARIA ORTIZ
1104511-0.6526820000012.01971TuesdayJuneMARIA ORTIZ
11045210.2129510000011.62644TuesdayJuneMARIA ORTIZ
11045310.6890490000011.62644TuesdayJuneMARIA ORTIZ
11045410.8188940000011.62644TuesdayJuneMARIA ORTIZ
11045510.6024850000011.62644TuesdayJuneMARIA ORTIZ
1104561-0.6959640000012.01971TuesdayJuneMARIA ORTIZ
11045710.03982410000012.01971TuesdayJuneMARIA ORTIZ
11045810.732330000012.01971TuesdayJuneMARIA ORTIZ
In [81]:
X.Gender_binary = CategoricalArray(X.Gender_binary)
X.Scholarship = CategoricalArray(X.Scholarship)
X.Hipertension = CategoricalArray(X.Hipertension)
X.Diabetes = CategoricalArray(X.Diabetes)
X.Alcoholism = CategoricalArray(X.Alcoholism)
X.Handcap = CategoricalArray(X.Handcap)
X.SMS_received = CategoricalArray(X.SMS_received)
X.day_of_week = CategoricalArray(X.day_of_week)
X.appointment_month = CategoricalArray(X.appointment_month)

X_features.Gender_binary = CategoricalArray(X_features.Gender_binary)
X_features.Scholarship = CategoricalArray(X_features.Scholarship)
X_features.Hipertension = CategoricalArray(X_features.Hipertension)
X_features.Diabetes = CategoricalArray(X_features.Diabetes)
X_features.Alcoholism = CategoricalArray(X_features.Alcoholism)
X_features.Handcap = CategoricalArray(X_features.Handcap)
X_features.SMS_received = CategoricalArray(X_features.SMS_received)
X_features.day_of_week = CategoricalArray(X_features.day_of_week)
X_features.appointment_month = CategoricalArray(X_features.appointment_month)
X_features.Neighbourhood = CategoricalArray(X_features.Neighbourhood)
110458-element CategoricalArray{String31,1,UInt32}:
 String31("JARDIM DA PENHA")
 String31("JARDIM DA PENHA")
 String31("MATA DA PRAIA")
 String31("PONTAL DE CAMBURI")
 String31("JARDIM DA PENHA")
 String31("REPÚBLICA")
 String31("GOIABEIRAS")
 String31("GOIABEIRAS")
 String31("ANDORINHAS")
 String31("CONQUISTA")
 ⋮
 String31("MARIA ORTIZ")
 String31("MARIA ORTIZ")
 String31("MARIA ORTIZ")
 String31("MARIA ORTIZ")
 String31("MARIA ORTIZ")
 String31("MARIA ORTIZ")
 String31("MARIA ORTIZ")
 String31("MARIA ORTIZ")
 String31("MARIA ORTIZ")
In [82]:
seed = 15095
(X_trainvalid, y_trainvalid), (X_test, y_test) = IAI.split_data(:classification, X, y, seed=seed, train_proportion=0.8)
(X_train, y_train), (X_valid, y_valid) =  IAI.split_data(:classification, X_trainvalid, y_trainvalid, seed=seed, train_proportion=0.8);

(Xfeat_trainvalid, yfeat_trainvalid), (Xfeat_test, yfeat_test) = IAI.split_data(:classification, X, y, seed=seed, train_proportion=0.8)
(Xfeat_train, yfeat_train), (Xfeat_valid, yfeat_valid) =  IAI.split_data(:classification, Xfeat_trainvalid, yfeat_trainvalid, seed=seed, train_proportion=0.8);

Feature Selection¶

In [83]:
grid = IAI.GridSearch(
    IAI.OptimalFeatureSelectionClassifier(
        random_seed=seed,
    ),
    sparsity=1:12,
)
IAI.fit!(grid, Xfeat_trainvalid, Array(yfeat_trainvalid), validation_criterion=:auc)
All Grid Results:

 Row │ sparsity  train_score  valid_score  rank_valid_score 
     │ Int64     Float64      Float64      Int64            
─────┼──────────────────────────────────────────────────────
   1 │        1    0.0308979     0.69331                  1
   2 │        2    0.0353047     0.656788                12
   3 │        3    0.038904      0.65837                 11
   4 │        4    0.0404003     0.659969                10
   5 │        5    0.0409848     0.660041                 9
   6 │        6    0.0413885     0.661043                 6
   7 │        7    0.0416255     0.661264                 5
   8 │        8    0.0416909     0.661485                 3
   9 │        9    0.041819      0.661758                 2
  10 │       10    0.0418773     0.660875                 7
  11 │       11    0.0419042     0.6613                   4
  12 │       12    0.0419735     0.660816                 8

Best Params:
  sparsity => 1

Best Model - Fitted OptimalFeatureSelectionClassifier:
  Constant: -1.38276
  Weights:
    lead_time_std:  0.384817
  (Higher score indicates stronger prediction for class `1`)
In [84]:
plot(grid, type=:validation)
No description has been provided for this image
In [85]:
IAI.variable_importance(IAI.get_learner(grid))
22×2 DataFrame
RowFeatureImportance
SymbolFloat64
1lead_time_std1.0
2Age_std0.0
3Alcoholism=10.0
4Diabetes=10.0
5Gender_binary=10.0
6Handcap=00.0
7Handcap=10.0
8Handcap=20.0
9Handcap=30.0
10Handcap=40.0
11Hipertension=10.0
12SMS_received=10.0
13Scholarship=10.0
14appointment_month=April0.0
15appointment_month=June0.0
16appointment_month=May0.0
17day_of_week=Friday0.0
18day_of_week=Monday0.0
19day_of_week=Saturday0.0
20day_of_week=Thursday0.0
21day_of_week=Tuesday0.0
22day_of_week=Wednesday0.0

Predictions¶

RF¶

In [86]:
# TODO Train a Random Forest 
# Grid search for hyperparameter tuning
max_depths = [15]
num_trees = [200]
minbuckets = [100, 200]
rf_grid = IAI.GridSearch(
    IAI.RandomForestClassifier(
        random_seed=seed,
        max_categoric_levels_before_warning=10000,
    ),
    max_depth=max_depths,
    num_trees=num_trees,
    minbucket=minbuckets,
)
# Fit training data
IAI.fit_cv!(rf_grid, X_trainvalid, Array(y_trainvalid), validation_criterion=:auc, n_folds=3)


# Calculate AUC score on the validation set (name the variable holding the AUC score as `val_auc`)
val_auc = IAI.score(rf_grid, X_valid, y_valid, criterion=:auc)
0.7589772832042827
In [87]:
# TODO: Get best model
model = rf_grid
lnr_rf = IAI.get_learner(model)

# TODO: Display selected model's parameters
params = IAI.get_params(lnr_rf)
println(params)
println("Max depth of the best model is ", params[:max_depth])
println("Number of trees of the best model is ", params[:num_trees])
println("Minbucket size of the best model is ", params[:minbucket])
Dict{Symbol, Any}(:fast_num_support_restarts => 0, :max_categoric_levels_before_warning => 10000, :split_features => All{Tuple{}}(()), :fast_test_intermediate_support => true, :num_trees => 200, :parallel_processes => nothing, :num_threads => nothing, :max_features => :auto, :show_progress => false, :checkpoint_max_files => nothing, :max_depth => 15, :hyperplane_config => NamedTuple[], :checkpoint_interval => 10, :regression_features => Any[], :weighted_minbucket => 1, :treat_unknown_level_missing => false, :random_seed => 15095, :checkpoint_dir => nothing, :fast_cumulative_support => true, :missingdatamode => :none, :refit_learner => nothing, :checkpoint_file => nothing, :fast_num_support_iterations => 1, :normalize_X => true, :bootstrap_samples => true, :criterion => :gini, :minbucket => 100, :cp_tuning_se_tolerance => 0.0, :cp => 0)
Max depth of the best model is 15
Number of trees of the best model is 200
Minbucket size of the best model is 100
In [88]:
# TODO: Calculate AUC score for test data
auc_score_rf = IAI.score(model, X_test, y_test, criterion=:auc)

# Get class predictions for test set
acc_rf = IAI.score(model, X_test, y_test, criterion=:misclassification)

println("Test AUC score: ", round(auc_score_rf, digits=4))
println("Test Accuracy: ", round(acc_rf, digits=4))
Test AUC score: 0.7331
Test Accuracy: 0.7978

CART¶

In [89]:
max_depths = [12,15,18]
minbuckets = [100,200]

grid_cart = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=seed,
        localsearch = false,
    ),
    max_depth=max_depths,
    minbucket=minbuckets,
    criterion=:gini
)

IAI.fit_cv!(grid_cart, X_trainvalid, Array(y_trainvalid), validation_criterion=:auc, n_folds=3)
Optimal Trees Visualization
In [90]:
# TODO: Get best model
model = grid_cart
lnr_cart = IAI.get_learner(model)

# TODO: Display selected model's parameters
params = IAI.get_params(lnr_cart)
println(params)
println("Max depth of the best model is ", params[:max_depth])
println("Minbucket size of the best model is ", params[:minbucket])
Dict{Symbol, Any}(:localsearch => false, :fast_num_support_restarts => 0, :max_categoric_levels_before_warning => 10, :split_features => All{Tuple{}}(()), :fast_test_intermediate_support => true, :ls_ignore_errors => false, :ls_num_hyper_restarts => 5, :max_depth => 12, :parallel_processes => nothing, :num_threads => nothing, :hyperplane_config => NamedTuple[], :show_progress => false, :ls_warmstart_criterion => :gini, :checkpoint_interval => 10, :ls_max_hyper_iterations => 9223372036854775807, :regression_features => Any[], :checkpoint_max_files => nothing, :weighted_minbucket => 1, :ls_bootstrap_samples => false, :treat_unknown_level_missing => false, :random_seed => 15095, :checkpoint_dir => nothing, :ls_num_greedy_features => :auto, :fast_cumulative_support => true, :missingdatamode => :none, :refit_learner => nothing, :fast_num_support_iterations => 1, :checkpoint_file => nothing, :normalize_X => true, :ls_num_categoric_restarts => 10, :criterion => :gini, :minbucket => 200, :ls_max_search_iterations => 9223372036854775807, :ls_scan_reverse_split => false, :ls_num_tree_restarts => 100, :cp_tuning_se_tolerance => 0.0, :cp => 0.0002150480321117902)
Max depth of the best model is 12
Minbucket size of the best model is 200
In [91]:
# TODO: Calculate AUC score for test data
auc_score_cart = IAI.score(model, X_test, y_test, criterion=:auc)

# Get class predictions for test set
acc_cart = IAI.score(model, X_test, y_test, criterion=:misclassification)

println("Test AUC score: ", round(auc_score_cart, digits=4))
println("Test Accuracy: ", round(acc_cart, digits=4))
Test AUC score: 0.724
Test Accuracy: 0.7981

OCT¶

In [92]:
# Train an OCT, this grid search might take some time
# Grid search for hyperparameter tuning
# TODO: 
max_depths = [4,6,8]
minbuckets = [100, 200]

grid_oct = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=seed,
        max_categoric_levels_before_warning=10000,
    ),
    max_depth=max_depths,
    minbucket=minbuckets,
    criterion=:gini,
) 

IAI.fit_cv!(grid_oct, X_trainvalid, Array(y_trainvalid), validation_criterion=:auc, n_folds=3)
Optimal Trees Visualization
In [93]:
# TODO: Get best model
model = grid_oct
lnr_oct = IAI.get_learner(model)

# TODO: Display selected model's parameters
params = IAI.get_params(lnr_oct)
println(params)
println("Max depth of the best model is ", params[:max_depth])
println("Minbucket size of the best model is ", params[:minbucket])
Dict{Symbol, Any}(:localsearch => true, :fast_num_support_restarts => 0, :max_categoric_levels_before_warning => 10000, :split_features => All{Tuple{}}(()), :fast_test_intermediate_support => true, :ls_ignore_errors => false, :ls_num_hyper_restarts => 5, :max_depth => 6, :parallel_processes => nothing, :num_threads => nothing, :hyperplane_config => NamedTuple[], :show_progress => false, :ls_warmstart_criterion => :gini, :checkpoint_interval => 10, :ls_max_hyper_iterations => 9223372036854775807, :regression_features => Any[], :checkpoint_max_files => nothing, :weighted_minbucket => 1, :ls_bootstrap_samples => false, :treat_unknown_level_missing => false, :random_seed => 15095, :checkpoint_dir => nothing, :ls_num_greedy_features => :auto, :fast_cumulative_support => true, :missingdatamode => :none, :refit_learner => nothing, :fast_num_support_iterations => 1, :checkpoint_file => nothing, :normalize_X => true, :ls_num_categoric_restarts => 10, :criterion => :gini, :minbucket => 200, :ls_max_search_iterations => 9223372036854775807, :ls_scan_reverse_split => false, :ls_num_tree_restarts => 100, :cp_tuning_se_tolerance => 0.0, :cp => 1.3361958865643064e-5)
Max depth of the best model is 6
Minbucket size of the best model is 200
In [94]:
# TODO: Calculate AUC score for test data
auc_score_oct = IAI.score(model, X_test, y_test, criterion=:auc)

# Get class predictions for test set
acc_oct = IAI.score(model, X_test, y_test, criterion=:misclassification)

println("Test AUC score: ", round(auc_score_oct, digits=4))
println("Test Accuracy: ", round(acc_oct, digits=4))
Test AUC score: 0.7253
Test Accuracy: 0.7981

Prescription¶

In [97]:
const C_IDLE  = 150.0
const C_DELAY = 75.0

# y_train: 1 = no-show, 0 = show
function leaf_costs_from_training_tree(lnr::IAI.TreeLearner,
                                       X_train,
                                       y_train;
                                       c_idle::Float64 = C_IDLE,
                                       c_delay::Float64 = C_DELAY)

    node_inds = IAI.apply_nodes(lnr, X_train)
    num_nodes = length(node_inds)

    leaf_cost0  = zeros(Float64, num_nodes)
    leaf_cost1  = zeros(Float64, num_nodes)
    leaf_p_show = zeros(Float64, num_nodes)

    for t in 1:num_nodes
        inds = node_inds[t]
        if isempty(inds) || !IAI.is_leaf(lnr, t)
            continue
        end

        n_leaf = length(inds)
        num_no   = sum(y_train[inds] .== 1)
        num_show = n_leaf - num_no

        frac_no   = num_no   / n_leaf
        frac_show = num_show / n_leaf

        leaf_cost0[t]  = c_idle  * frac_no
        leaf_cost1[t]  = c_delay * frac_show
        leaf_p_show[t] = frac_show
    end

    return leaf_cost0, leaf_cost1, leaf_p_show
end
leaf_costs_from_training_tree (generic function with 1 method)
In [98]:
function prescribe_tree_with_gurobi(lnr::IAI.TreeLearner,
                                    X_train,
                                    y_train,
                                    X_test;
                                    c_idle::Float64 = C_IDLE,
                                    c_delay::Float64 = C_DELAY)

    leaf_cost0, leaf_cost1, leaf_p_show =
        leaf_costs_from_training_tree(lnr, X_train, y_train;
                                      c_idle=c_idle, c_delay=c_delay)

    leaf_test = IAI.apply(lnr, X_test)
    n_test = length(leaf_test)

    model = Model(Gurobi.Optimizer)
    set_silent(model)

    @variable(model, z[1:n_test], Bin)   

    @objective(model, Min,
        sum( leaf_cost0[leaf_test[j]] * (1 - z[j]) +
             leaf_cost1[leaf_test[j]] * z[j]
             for j in 1:n_test)
    )

    optimize!(model)

    z_opt = Int.(round.(value.(z)))

    cost_base = [leaf_cost0[leaf_test[j]] for j in 1:n_test]
    cost_pol  = [leaf_cost0[leaf_test[j]] * (1 - z_opt[j]) +
                 leaf_cost1[leaf_test[j]] * z_opt[j]
                 for j in 1:n_test]

    avg_cost_base = mean(cost_base)
    avg_cost_pol  = mean(cost_pol)
    savings       = avg_cost_base - avg_cost_pol

    seen_base = Float64[]
    seen_pol  = Float64[]
    for j in 1:n_test
        p = leaf_p_show[leaf_test[j]]    
        push!(seen_base, p)              
        b = 1 + z_opt[j]                 
        push!(seen_pol, 1 - (1 - p)^b)   
    end

    avg_seen_base = mean(seen_base)
    avg_seen_pol  = mean(seen_pol)
    extra_seen    = avg_seen_pol - avg_seen_base

    return (
        z_opt         = z_opt,
        avg_cost_base = avg_cost_base,
        avg_cost_pol  = avg_cost_pol,
        savings       = savings,
        avg_seen_base = avg_seen_base,
        avg_seen_pol  = avg_seen_pol,
        extra_seen    = extra_seen,
    )
end
prescribe_tree_with_gurobi (generic function with 1 method)
In [99]:
function prescribe_rf_with_gurobi(rf_lnr,
                                  X_test;
                                  c_idle::Float64 = C_IDLE,
                                  c_delay::Float64 = C_DELAY)

    proba = IAI.predict_proba(rf_lnr, X_test)
    p_no   = proba[:, 2]           
    p_show = 1 .- p_no             

    n_test = length(p_show)

    cost0 = c_idle  .* (1 .- p_show)
    cost1 = c_delay .* p_show

    model = Model(Gurobi.Optimizer)
    set_silent(model)

    @variable(model, z[1:n_test], Bin)

    @objective(model, Min,
        sum(cost0[j] * (1 - z[j]) + cost1[j] * z[j] for j in 1:n_test)
    )

    optimize!(model)

    z_opt = Int.(round.(value.(z)))

    cost_base = cost0                 
    cost_pol  = [cost0[j] * (1 - z_opt[j]) + cost1[j] * z_opt[j]
                 for j in 1:n_test]

    avg_cost_base = mean(cost_base)
    avg_cost_pol  = mean(cost_pol)
    savings       = avg_cost_base - avg_cost_pol

    seen_base = p_show                            
    seen_pol  = [1 - (1 - p_show[j])^(1 + z_opt[j]) for j in 1:n_test]

    avg_seen_base = mean(seen_base)
    avg_seen_pol  = mean(seen_pol)
    extra_seen    = avg_seen_pol - avg_seen_base

    return (
        z_opt         = z_opt,
        avg_cost_base = avg_cost_base,
        avg_cost_pol  = avg_cost_pol,
        savings       = savings,
        avg_seen_base = avg_seen_base,
        avg_seen_pol  = avg_seen_pol,
        extra_seen    = extra_seen,
    )
end
prescribe_rf_with_gurobi (generic function with 1 method)
In [100]:
# Best learners from your grids
oct_lnr  =lnr_oct
cart_lnr =lnr_cart
rf_lnr   =lnr_rf

res_oct  = prescribe_tree_with_gurobi(oct_lnr,  X_train, y_train, X_test)
res_cart = prescribe_tree_with_gurobi(cart_lnr, X_train, y_train, X_test)
res_rf   = prescribe_rf_with_gurobi(rf_lnr, X_test)

println("=== Expected Cost (BRL per slot) ===")
println("OCT:   baseline = ", round(res_oct.avg_cost_base,  digits=2),
        ", policy = ",        round(res_oct.avg_cost_pol,   digits=2),
        ", savings = ",       round(res_oct.savings,        digits=2))

println("CART:  baseline = ", round(res_cart.avg_cost_base, digits=2),
        ", policy = ",        round(res_cart.avg_cost_pol,  digits=2),
        ", savings = ",       round(res_cart.savings,       digits=2))

println("RF:    baseline = ", round(res_rf.avg_cost_base,   digits=2),
        ", policy = ",        round(res_rf.avg_cost_pol,    digits=2),
        ", savings = ",       round(res_rf.savings,         digits=2))

println("\n=== Expected Patients Seen per Slot ===")
println("OCT:   baseline = ", round(res_oct.avg_seen_base,  digits=4),
        ", policy = ",        round(res_oct.avg_seen_pol,   digits=4),
        ", extra = ",         round(res_oct.extra_seen,     digits=4))

println("CART:  baseline = ", round(res_cart.avg_seen_base, digits=4),
        ", policy = ",        round(res_cart.avg_seen_pol,  digits=4),
        ", extra = ",         round(res_cart.extra_seen,    digits=4))

println("RF:    baseline = ", round(res_rf.avg_seen_base,   digits=4),
        ", policy = ",        round(res_rf.avg_seen_pol,    digits=4),
        ", extra = ",         round(res_rf.extra_seen,      digits=4))
Set parameter Username
Set parameter LicenseID to value 2710905
Academic license - for non-commercial use only - expires 2026-09-20
Set parameter Username
Set parameter LicenseID to value 2710905
Academic license - for non-commercial use only - expires 2026-09-20
Set parameter Username
Set parameter LicenseID to value 2710905
Academic license - for non-commercial use only - expires 2026-09-20
=== Expected Cost (BRL per slot) ===
OCT:   baseline = 30.24, policy = 28.06, savings = 2.18
CART:  baseline = 30.35, policy = 28.63, savings = 1.72
RF:    baseline = 30.32, policy = 28.44, savings = 1.88

=== Expected Patients Seen per Slot ===
OCT:   baseline = 0.7984, policy = 0.8428, extra = 0.0444
CART:  baseline = 0.7977, policy = 0.8386, extra = 0.0409
RF:    baseline = 0.7978, policy = 0.8367, extra = 0.0388

Set parameter LicenseID to value 2710905
Academic license - for non-commercial use only - expires 2026-09-20
Set parameter Username
Set parameter LicenseID to value 2710905
Academic license - for non-commercial use only - expires 2026-09-20
Set parameter Username
Set parameter LicenseID to value 2710905
Academic license - for non-commercial use only - expires 2026-09-20
=== Expected Cost (BRL per slot) ===
OCT:   baseline = 30.24, policy = 28.06, savings = 2.18
CART:  baseline = 30.35, policy = 28.63, savings = 1.72
RF:    baseline = 30.32, policy = 28.44, savings = 1.88

=== Expected Patients Seen per Slot ===
OCT:   baseline = 0.7984, policy = 0.8428, extra = 0.0444
CART:  baseline = 0.7977, policy = 0.8386, extra = 0.0409
RF:    baseline = 0.7978, policy = 0.8367, extra = 0.0388